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1.  Introduction 

1.1.  Computational  Vision 

Computational  vision  denotes  a  new  field  in  artificial  intelligence  that  has  devel¬ 
oped  in  the  last  15  years.  Its  two  main  goals  are  to  develop  image  understanding 
systems  which  automatically  provide  scene  descriptions  from  real  images,  and 
to  understand  biological  vision.  Its  main  focus  is  on  theoretical  studies  of  vision, 
considered  as  an  information  processing  task. 

Since  at  least  the  work  of  David  Marr  (Marr,  1982;  see  also  Marr  and  Pog- 
gio,  1977),  it  has  been  customary  to  consider  vision  as  an  information  processing 
system  that  can  be  divided  into  several  modules  at  different  theoretical  levels, 
at  least  as  a  first  approximation.  In  particular,  Marr  suggested  that  the  goal  of 
the  first  step  of  vision  is  to  obtain  descriptions  of  physical  properties  of  three- 
dimensional  surfaces  around  the  viewer  such  as  distance,  orientation,  texture, 
and  reflectance.  This  first  step  of  vision,  up  to  what  has  been  called  2-1/2  D 
sketch  or  intrinsic  images ,  is  mainly  bottom-up,  relying  on  general  knowledge 
but  no  special  high-level  information  about  the  scene  to  be  analyzed. 

The  first  part  of  vision-from  images  to  surfaces-has  been  termed  early  vi¬ 
sion.  Although  this  point-of-view  has  been  embraced  widely  (see  a  set  of  recent 
reviews,  e.g.,  Brown,  1984;  Brady,  1981;  Barrow  and  Tannenbaum,  1981;  Pog- 
gio,  1984),  it  is  important  to  observe  that  its  correctness  is  still  to  be  proven.  In 
particular,  it  is  still  unclear  what  the  nature  of  the  2-1/2-D  sketch  representation 
is,  how  different  visual  modules  interact,  how  their  output  is  fused  and  what  is 
the  role  of  high-level  knowledge  on  early  visual  processes.  The  critical  problem 
of  the  organization  of  vision  and  of  the  control  of  the  flow  of  information  from 
the  different  modules  and  how  high-level  knowledge  is  used  is  still  very  much  an 
open  problem. 

In  this  paper,  we  do  not  consider  this  larger  issue.  Our  point-of-view  is  that 
a  rigorous  analysis  of  individual  modules  of  vision  is  bound  to  play  an  important 
role  in  any  full  theory  of  vision. 

\.2.  Early  Vision 

Early  vision  consists  of  a  set  of  processes  that  recover  physical  properties  of 
visibly  three-dimensional  surfaces  from  the  two-dimensional  images.  Computa¬ 
tional,  biological  and  epistemological  arguments  (see  Marr  and  Poggio,  1977) 
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•  Edge  Detection 

•  Spatio-temporal  interpolation  and  approximation 

•  Computation  of  optical  flow 

•  Computation  of  lightness  and  albedo 

•  Shape  from  contours 

•  Shape  from  texture 

•  Shape  from  shading 

•  Binocular  Stereo 

•  Structure  from  motion 

•  Structure  from  stereo 

•  Surface  reconstruction 

•  Computation  of  Surface  Color 


Table  1.  Examples  of  early  vision  processes. 


suggest  that  early  vision  processes  are  generic  ones  that  correspond  to  concep¬ 
tually  independent  modules  that  can  be  studied,  at  least  to  a  very  first  approx¬ 
imation,  in  isolation.  Table  1  shows  a  list  of  some  of  the  early  vision  modules. 

The  standard  definition  of  computational  vision  is  that  it  is  inverse  optics. 
The  direct  problein-the  problem  of  classical  optics-or  computer  graphics-is  to 
determine  the  images  of  three-dimensional  objects.  Computational  vision  is 
confronted  with  inverse  problems  of  recovering  surfaces  from  images.  Much  in¬ 
formation  is  lost  during  the  imaging  process  that  projects  a  three-dimensional 
world  into  two-dimensional  arrays  (images).  As  a  consequence,  vision  must  rely 
on  natural  constraints,  that  is,  general  assumptions  about  the  physical  world 
to  derive  an  unambiguous  output.  This  is  typical  of  many  inverse  problems  in 
mathematics  and  physics. 

In  fact,  the  common  characteristics  of  most  early  vision  problems,  in  a  sense 
their  deep  structure,  can  be  formalized:  early  vision  problems  are  ill-posed  in 
the  sense  defined  by  Hadamard.  A  problem  is  well-posed  when  its  solution  (a) 
exists,  (b)  is  unique  and  (c)  depends  continuously  on  the  initial  data.  Ill-posed 
problems  fail  to  satisfy  one  or  more  of  these  criteria. 

Brrtero,  Poggio  and  Torre  (1987)  show  precisely  the  mathematically  ill- 
posed  structure  of  several  problems  listed  in  Table  1  (see  also  Poggio  and  Torre, 


1984.)  The  recognition  that  early  vision  problems  are  ill-posed  suggests  imme¬ 
diately  the  use  of  regularization  methods  developed  in  mathematics  and  math¬ 
ematical  physics  for  solving  the  ill-posed  problems  of  early  vision  (Poggio  and 
Torre,  1984).  Without  an  explicit  connection  with  regularization  techniques, 
B.  Horn  (1986)  had  earlier  approached  several  problems  in  vision  from  a  very 
similar  point  of  view,  using  minimization  techniques  for  their  solution. 

1.3.  Standard  Regularization  in  Early  Vision 

The  main  idea  for  “solving”  ill-posed  problems  is  to  restrict  the  class  of  admis¬ 
sible  solutions  by  introducing  suitable  a  priori  knowledge.  In  standard  regu¬ 
larization  methods,  due  mainly  to  Tikhonov,  the  regularization  of  the  ill-posed 
problem  of  finding  z  from  the  data  y  : 

Az  =  y  (1) 

results  in  finding  z  that  minimizes 

\\Az  —  y\\2  +  A||Pz||2,  (2) 

where  A  is  a  so-called  regularization  parameter,  ||  •  ||  is  the  norm  and  ||Pz||  is 
called  a  stabilizing  functional.  In  standard  regularization  theory,  A  is  a  linear 
operator,  the  norms  are  quadratic  and  P  is  linear.  In  this  method,  A  controls  the 
compromise  between  the  degree  of  regularization  of  a  solution  and  its  closeness 
to  the  data  (the  first  term  in  equation  2).  P  embeds  the  physical  constraints 
of  the  problem.  It  can  be  shown  for  quadratic  variational  principles  that  under 
mild  conditions  the  solution  space  is  convex  and  a  unique  solution  exists. 

Poggio  et  al  (1984,  1985)  show  that  several  problems  in  early  vision  can  be 
“solved”  by  standard  regularization  techniques  and  Beitero  et  al.  (1987)  give 
a  rigorous  analysis.  Surface  reconstruction,  optical  flow  at  each  point  in  the 
image,  optical  flow  along  contours,  color,  and  stereo  can  be  computed  by  using 
standard  regularization  techniques.  Variational  principles  that  are  not  exactly 
quadratic  but  have  the  same  form  as  equation  2  can  be  used  for  other  problems 
in  early  vision.  The  main  results  of  Tikhonov  can,  in  fact,  be  extended  to  some 
cases  in  which  the  operators  A  and  P  are  nonlinear,  provided  they  satisfy  certain 
conditions  (Morozov,  1984.) 

Standard  regularization  methods  can  be  implemented  very  efficiently  by 
parallel  architectures  of  the  fine-grain  type,  such  as  the  Connection  Machine 
(Hillis,  1985).  Minimizing  formula  (2)  generates  a  convolution  operator  when 
certain  conditions  are  met  (Poggio  et  al.,  1986).  Analog  networks,  either  elec¬ 
trical  or  chemical,  can  also  be  a  natural  way  of  solving  the  variational  principles 
dictated  by  standard  regularization  theory  (Poggio  and  Koch,  1984,  1985).  A 
list  of  the  pioblems  that  can  be  regularized  by  standard  legularization  theon  or 
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Problem 

Edge  detection 
Optical  Flow 

(area  based) 
Optical  Flow 
(contour  based) 
Surface 

Reconstruction 

Spatiotemporal 

approximation 

Color 

Shape  from 

Shading 

Stereo 


Regularization  Principle 

/  [(5/  —  i)2  +  A(/Ir)2]dx 
/  (iru  +  iyv  +  it)2  +  A(u2  +  u2  +  t'l  +  u2)]  dxdy 

/[(VN-t^)2  +  A(AV)2]ds 

/  [(5  -f~d)2  +  A  {flt  +  2  fly  4-  flyf  dx  dy 

f  [(S /  -  I')2  +  A (V/  •  V  +  ft)2]dx  dy  dt 


I"  -  Az\2  +  X\\Pz  2 


f  (E  -  R(f,  g))2  +  A(/r2  +  fy 2  4-  ffr2  +  ffy2)  dx  dy 
/j  V2G  *  (l(x,y)  -  R(x  -f  d(i,y),y)j  +A(Vd)2jd 


Table  2.  Regularization  in  early  vision. 


slightly  non-linear  versions  of  it  are  listed  in  Table  2,  together  with  the  associated 
regularization  principle. 


1.4.  Limitations  of  Standard  Regularization  Theory 

This  new  theoretical  framework  for  early  vision  shows  clearly  not  only  the  at¬ 
tractions,  but  also  the  limitations,  that  are  intrinsic  to  the  standard  Tikhonov 
form  of  regularization  theory.  Standard  regularization  methods  lead  to  satisfac¬ 
tory  solutions  of  early  vision  problems  but  cannot  deal  effectively  and  directly 
with  a  few  general  problems  such  as  discontinuities  and  fusion  of  information 
from  multiple  modules. 

Standard  regularization  theory  with  linear  A  and  P  is  equivalent  to  re¬ 
stricting  the  space  of  solution  to  generalized  splines,  whose  order  depends  on 
the  order  of  the  stabilizer  P.  This  means  that  in  some  cases  the  solution  is  too 
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smooth,  and  cannot  be  faithful  in  locations  where  discontinuities  are  present. 
In  optical  flow,  surface  reconstruction  and  stereo,  discontinuities  are  in  fact  not 
only  present,  but  also  the  most  critical  locations  for  subsequent  visual  informa¬ 
tion  processing.  Standard  regularization  cannot  deal  well  with  another  critical 
problem  of  vision,  the  problem  of  fusing  information  from  different  early  vision 
modules.  Since  the  regularizing  principles  of  the  standard  theory  are  quadratic, 
they  lead  to  linear  Euler-Lagrange  equations.  The  output  of  different  modules 
can  therefore  be  combined  only  in  a  linear  way.  Terzopoulous  (1984;  see  also 
Poggio  et  al.,  1985)  has  shown  how  standard  regularization  techniques  can  be 
used  in  the  presence  of  discontinuities  in  the  case  of  surface  interpolation.  After 
standard  regularization,  locations  where  the  solution  /  originates  a  large  error 
in  the  second  term  of  equation  2,  are  identified  (this  needs  setting  a  threshold 
for  the  error  in  smoothness).  A  second  regularization  step  is  then  performed 
using  the  location  of  discontinuities  as  boundary  conditions. 

A  similar  method  could  be  used  for  fusing  information  from  multiple  sour¬ 
ces:  a  regularizing  step  could  be  performed  and  locations  where  terms  of  the  type 
of  the  first  term  of  equation  2  give  large  errors  would  be  identified.  A  decision 
step  would  then  follow  by  setting  appropriately  various  controlling  parameters  in 
those  locations,  therefore  weighting  in  an  appropriate  way  (for  instance,  vetoing 
some  of)  the  various  contributing  processes. 

One  would  like,  however,  a  more  comprehensive  and  coherent  theory  capa¬ 
ble  of  dealing  directly  with  the  problem  of  discontinuities  and  the  problem  of 
fusing  information.  So  the  challenge  for  a  regularization  theory  of  early  vision 
is  to  extend  it  beyond  standard  regularization  methods  and  their  most  obvious 
non-linear  versions. 

1.5.  Stochastic  Route  to  Regularizing  Early  Vision 

In  this  paper,  we  will  outline  a  rigorous  approach  to  overcome  part  of  the  ill- 
posedness  of  vision  problems,  based  on  Bayes  estimation  and  Markov  Random 
Field  models,  that  effectively  deals  with  the  problems  faced  by  the  standard 
regularization  approach.  In  this  approach,  the  a  prion  knowledge  is  r<  presented 
in  terms  of  an  appropriate  probability  distribution,  whereas  in  standard  regu¬ 
larization  a  priori  knowledge  leads  to  restrictions  on  the  solution  space  This 
distribution,  together  with  a  probabilistic  description  of  the  noise  tha'  corrupts 
the  observations,  allows  one  to  use  Bayes  theory  to  compute  the  posterior  distri¬ 
bution  Pf\9'  which  represents  the  likelihood  of  a  solution  /  given  the  observations 
g.  In  this  way,  we  can  solve  the  reconstruction  problem  by  finding  the  estimate 
/  which  either  maximizes  this  a  posteriori  probability  distribution  (the  so  called 
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Maximum  a  Posteriori  or  MAP  estimate),  or  minimizes  the  expected  value  (with 
respect  to  Pf\g)  of  an  appropriate  error  function.  The  class  of  solutions  that  can 
be  obtained  in  this  way  is  much  larger  than  in  standard  regularization.  In  par¬ 
ticular,  we  will  show  under  which  conditions  this  new  method  leads  to  solutions 
that  are  of  the  standard  regularization  type  (see  Section  3). 

The  price  to  be  paid  for  this  increased  flexibility  is  computational  complex¬ 
ity.  New  parallel  architectures  and  possibly  hybrid  computers  of  the  digital- 
analog  type  promise  however  to  deal  effectively  with  the  computational  require¬ 
ments  of  the  methods  proposed  here.  We  will  discuss  these  new  parallel  archi¬ 
tectures  in  some  detail  at  the  end  of  the  paper. 


2.  Probabilistic  Models 


In  the  rest  of  the  paper  we  are  concetrating  on  one  specific  problem  in  early 
vision,  the  problem  of  surface  reconstruction. 

The  key  to  the  success  in  the  use  of  the  probabilistic  approach  is  our  abil¬ 
ity  to  find  a  class  of  stochastic  models  (that  is,  random  fields)  that  have  the 
following  characteristics: 

The  probabilistic  dependencies  between  the  elements  of  the  field 
should  be  local.  This  condition  is  necessary  if  the  field  is  to  be 
used  to  model  surfaces  that  are  only  piecewise  smooth;  besides, 
if  it  is  satisfied,  the  reconstruction  algorithms  are  likely  to  be 
distributed,  and  thus,  efficiently  implementable  in  parallel  hard¬ 
ware. 

The  class  should  be  rich  enough,  so  that  a  wide  variety  of  quali¬ 
tatively  different  behaviors  can  be  modeled. 

The  relation  between  the  parameters  of  the  models  and  the  char¬ 
acteristics  of  the  corresponding  sample  fields  should  be  relatively 
transparent,  so  that  the  models  are  easy  to  specify. 

It  should  be  possible  to  represent  the  prior  probability  distribu¬ 
tion  Pf  explicitly,  so  that  Bayes  theory  can  be  applied. 

It  should  be  possible  to  specify  efficient  Monte  Carlo  procedures, 
both  for  generating  sample  fields  from  the  distribution,  so  that 
the  capability  of  the  model  to  represent  our  prior  knowledge  can 
be  verified,  and  to  compute  the  optimal  estimators. 
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A  class  of  random  fields  that  satisfies  these  requirements  is  the  class  of 
Markov  Random  Fields  (MRF’s)  on  finite  lattices  (see  Wong,  1968  and  Woods, 
1972).  A  MRF  has  the  property  that  the  probability  distribution  of  the  configu¬ 
rations  of  the  field  can  always  be  expressed  in  the  form  of  a  Gibbs  distribution: 

PfU)  =  (3) 

where  Z  is  a  normalizing  constant,  To  is  a  parameter  (known  as  the  “natural 
temperature”  of  the  field)  and  the  “Energy  function”  U(f )  is  of  the  form: 

w)  =  £’«/)  (4) 

c 

where  C  ranges  over  the  “cliques”  associated  with  the  neighborhood  system  of 
the  field,  and  the  potentials  Vc(f)  are  functions  supported  on  them  (a  clique  is 
either  a  single  site,  or  a  set  of  sites  such  that  any  two  sites  belonging  to  it  are 
neighbors  of  each  other). 


As  an  example,  the  behavior  of  piecewise  constant  functions  can  be  mod¬ 
eled  using  first  order  MRF  models  on  a  finite  lattice  L  with  generalized  Ising 
potentials  (Geman  and  Goman,  1984): 


Vc(fiJj) 


'  -1,  if  \i  —  j\  =  1  and  f,  = 
i  1,  if  I*  —  j\  =  1  and  /,  ±  fj 

k  0,  otherwise. 


/.  €  Qi  =  {gi,. .  .,?a/}  for  all  i  €  L 


(5) 


We  will  use  a  free  boundary  model,  so  that  the  neighborhood  size  for  a  given 
site  will  be:  4,  if  it  is  in  the  interior  of  the  lattice;  3,  if  it  lies  at  a  boundary,  but 
not  at  a  corner,  and  2  for  the  comers. 


The  Gibbs  distribution: 


PfU)  =  \  exP  [  - 


Uo  (/)  =  ][>(/„/,) 


(6) 


>■} 


defines  a  one  parameter  family  of  models  (indexed  by  7'0)  describing  piecewise 
constant  patterns  with  varying  degrees  of  granularity. 


We  will  assume  that  the  available  observations  <j  are  obtained  from  a  typical 
realization  /  of  the  field  by  a  degrading  operation  (such  as  sampling)  followed 
by  corruption  with  i.i.d.  noise  (the  form  of  whose  distribution  is  known),  so  that 
the  conditional  distribution  can  be  written  as: 


^f|/(!?;/)  =  exp(-nre)  (7) 

with  e  =  /Ci€s$i(/i  <7i),  where  {$*}  are  some  known  functions,  and  a  is  a 
parameter. 


i 


0 


The  posterior  distribution  is  obtained  from  Bayes  rule: 

Pf\g(f\  9)  =  exp  [  -  UP(f;  5)] 


Zn 


with 


Witf)  =  ^W) +  £*(/,  <7. ) 


(8) 


(9) 


«€S 


For  example,  in  the  case  of  binary  fields  (M  =  2)  with  the  observations  taken 
as  the  output  of  a  binary  symmetric  channel  (BSC)  with  error  rate  e  (Gallager, 
1975),  we  have: 

The  posterior  energy  reduces  to: 


To 


uj 


where  /,  <E  {91.52}: 


and 


a  =  In 


-  e),  for  gi  =  f, 
for  g{  ±  fi 

(10) 

j)  +  ~  *(/<  “  V')) 

i 

(11) 

if  a  —■  0 

otherwise. 

(12) 

1  -  A 

(12) 

3.  Cost  Functionals 


The  Bayesian  approach  to  the  solution  of  reconstruction  problems  has  been 
adopted  by  several  researchers.  In  most  cases,  the  criterion  for  selecting  the 
optimal  estimate  has  been  the  maximization  of  the  posterior  probability  (the 
Maximum  a  Posteriori  or  MAP  estimate).  It  has  been  used,  for  example,  by 
Gcman  and  Geman  (19S4)  for  the  restoration  of  piecewise  constant  images;  by 
Grenander  (1984)  for  pattern  reconstruction,  and  by  Elliot  et.  al.  (1983)  and 
Hansen  and  Elliot  (1982)  for  the  segmentation  of  textured  images  (a  similar 
criterion  —  the  maximization  of  a  suitably  defined  likelihood  function  —  has 
been  used  by  Cohen  and  Cooper  (1984)  for  the  same  purposes). 

In  some  other  cases,  a  performance  criterion,  such  as  the  minimization  of 
the  mean  squared  error  has  been  implicitly  used  for  the  estimation  of  particular 
classes  of  fields.  For  example,  for  continuous-valued  fields  with  exponential  au¬ 
tocorrelation  functions,  corrupted  by  additive  white  Gaussian  noise,  Nahi  and 
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Assefi  (1972)  and  Habibi  (1972)  have  used  causal  linear  models  and  optimal 
(Kalman)  linear  filters  for  solving  the  reconstruction  problem. 

The  minimization  of  the  expected  value  of  error  functionals,  however,  has 
not  been  used  as  an  explicit  criterion  for  designing  optimal  estimators  in  the 
general  case.  We  will  show  that  this  design  criterion  is  in  fact  more  appropriate 
in  our  case,  for  the  following  reasons: 

It  permits  one  to  adapt  the  estimator  to  each  particular  problem. 

It  is  in  closer  agreement  with  one’s  intuitive  assessment  of  the 

performance  of  an  estimator. 

It  leads  to  attractive  computational  schemes. 

As  an  example,  we  will  now  propose  design  criteria  for  two  particular  prob¬ 
lems:  image  segmentation  and  surface  reconstruction. 

Consider  a  field  /  with  N  elements  each  of  which  can  belong  to  one  of  a 
finite  set  Q ,  of  classes.  Let  fi  denote  the  class  to  which  the  ith  element  be¬ 
longs.  The  segmentation  problem  is  to  estimate  /  from  a  set  of  observations 
{flfi ,  • .  • ,  <7|>}-  Note  that  /,  does  not  necessarily  correspond  to  the  image  inten¬ 
sity.  It  may  represent,  for  example,  the  texture  class  for  a  region  in  the  image 
(as  in  Elliot  et.  al.,  1983),  etc. 

A  reasonable  criterion  for  the  performance  of  an  estimate  /  is  the  number  of 
elements  that  are  not  classified  correctly.  Therefore,  we  define  the  segmentation 
error  ea  as: 

N 

=  £(1  -6{fi  -/,)),  /,,/,  G  Q,.  (14) 

1=1 

In  the  case  of  the  reconstruction  problem,  an  estimate  /  should  be  considered 
“good’  if  it  is  close  to  /  in  the  ordinary  sense,  so  that  the  total  squared  error 

N 

tr (/,/)  =  £(/,  -/,)2  (15) 

1=1 

will  be  a  reasonable  measure  for  its  performance. 

To  derive  the  optimal  estimators  with  respect  to  the  criteria  stated  above, 
we  first  present  the  general  result  (which  can  be  found,  for  example  in  Abend. 
1968)  which  states  that  if  the  posterior  marginal  distributions  for  every  ele¬ 
ment  of  the  field  are  known,  the  optimal  Bayesian  estimator  with  respect  to 
any  additive,  positive  definite  cost  functional  C  may  be  found  by  independently 
minimizing  the  marginal  expected  cost  for  each  element 


iiaaMaaiaiaia^ 


v  V 
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In  more  precise  terms,  we  w  ill  muskier  cost  functionals  C(f,  f)  of  the  form: 


C’(/,/)  =  ^C, (/,./,) 

(10) 

>et. 

II 

=> 

II 

3- 

’  \  >  0  if  a  ^  6,  for  all  i. 

(17) 

We  will  assume  that  the  value  of  each  element  /,  of  the  field  /  is  constrained 
to  belong  to  some  finite  set  Q,  (the  generalization  to  the  case  of  compart  sets 
is  straightforward).  The  Optimal  Bayesian  estimator  f*  with  respect  to  the 
cost  functional  C  is  defined  as  the  global  minimizer  of  the  expected  value  of  C' 
over  all  possible  /  and  g.  One  can  prove  that  this  estimate  can  be  found  by 
minimizing  independently  the  marginal  expected  cost  for  each  element,  i.e.. 

/,*=<?  :  ^2  Ci(r,q)Pi(r\g)  <  ^  C,{r,  s)P,(r\g) 

r£Q,  reQ, 

for  all  s  7^  q,  and  for  all  ?  €  L  (IS) 

where  Pt{r\g)  is  the  posterior  marginal  distribution  of  the  clement  i: 

P.(r\g)  =  ^2  Pf\ g(f'il)  (19) 

!  fi=r 

The  optimal  estimators  for  the  error  criteria  defined  above  can  be  easily 
derived  from  this  result: 

In  the  case  of  the  segmentation  problem,  we  get  that 

/.*  =  q  <E  Q,  :  P,(q\g)  >  P,(s\g) 

for  all  s  ^  q.  (20) 

We  will  call  this  estimate  the  “Maximizer  of  the  Posterior  Marginals"  (/ mpm ). 

For  the  reconstruction  problem,  the  optimal  estimate  is: 

/,*  ~  g  --  Q,  :  (f.-g)2 

for  all  s  ^  q  (21 ) 

W<  will  call  tl.i>  e-rimate  the  “Thresholded  Posterior  Mean"  (/ i'p,\i  ). 

Ili'-  mam  obstacle  ft »j  t he  practical  application  of  these  results  lies  in  the 
?■  »i  malabii-  computational  cost  associated  with  the  exact  computation  of  the 
:..a:t;mah  and  the  mean  of  the  posterior  distribution  given  by  (9),  even  for  la t - 
'  •  •  •  of  moderate  si/,  fi,  the  next  section  we  will  present  a  general  distributed 
. . wo  '  hat  will  permit  u-  to  approximate  these  quantities  as  precisely  as  we 

want 


.  ,  •  ,  ,  ■  »  „  •  ,  «  %  *.  %  %  1  ■»  » 
*  -*  ^  M  ^  ^-*  w*  a  *|J  ^  m  ^ 
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4.  Algorithms 


The  algorithms  that  we  will  propose  are  based  on  the  use  of  the  Metropolis 
(Metropolis  et  al.,  1956)  or  Gibbs  Sampler  (Geman  and  Geman,  1984)  schemes, 
to  simulate  the  equilibrium  behavior  of  the  coupled  MRF  described  by  Equation 
(9).  We  recall  that  the  Markov  chain  generated  by  these  algorithms  is  regular, 
and  their  invariant  measure  is  the  posterior  distribution  P/\g.  The  law  of  large 
numbers  for  regular  chains  (see,  for  example,  Kemeny  and  Snell,  1960)  estab¬ 
lishes  that  the  fraction  of  time  that  the  chain  will  spend  on  a  given  state  /  will 
tend  to  Pf\g(f',  g)  as  the  number  of  steps  gets  large,  independently  of  the  initial 
state.  This  means  that  we  can  approximate  the  posterior  marginals  by: 

Pi(g\g) «  jr~  Kfil)  -  <i)  (22) 


and  /  by: 


/, 


n 


£/< 


(0 


-  k  ^ 

t=k 


(23) 


where  is  the  configuration  generated  by  the  Metropolis  algorithm  at  time  t, 
and  k  is  the  time  required  for  the  system  to  be  in  thermal  equilibrium.  From 
these  values,  / mpm  and  /tpm  can  be  easily  computed  using  (20)  and  (21). 


This  procedure  is  related  to  the  use  of  simulated  annealing  for  finding  the 
global  minimum  of  Up  (i.e.,  the  MAP  estimate:  see  Geman  and  Geman,  1984). 
In  our  case,  however,  we  are  interested  in  gathering  statistics  about  the  equi¬ 
librium  behavior  of  the  coupled  field  at  a  fixed  temperature  T  =  1,  rather  than 
in  finding  the  ground  state  of  the  system.  This  fact  gives  our  procedure  some 
distinct  advantages: 


1.  It  is  difficult  to  determine  in  general  the  descent  rate  of  the  temperature 
(annealing  schedule)  that  will  guarantee  the  convergence  of  the  annealing 
process  in  a  reasonable  time  (it  usually  involves  a  trial  and  error  procedure; 
though  Geman  and  Geman  (1984)  gave  theoretical  bounds  for  the  annealing 
schedule,  these  bounds  are  impractical).  Since  we  are  running  the  Metropo¬ 
lis  algorithm  at  a  fixed  temperature,  this  issue  becomes  irrelevant. 

2.  Since  in  our  case  we  are  using  a  Monte  Carlo  procedure  to  approximate 
the  values  of  some  integrals,  we  should  expect  a  nice  convergence  behavior, 
in  the  sense  that  coarse  approximations  can  be  computed  very  rapidly,  and 
then  refined  to  an  arbitrary  precision  (in  fact,  it  can  be  proved  (>ee  Feller, 
1950)  that  the  expected  value  of  the  squared  error  of  the  estimates  (22)  and 
(23)  is  inversely  proportional  to  n). 


Figure  1.  Panel  la)  repiesents  a  realization  of  a  (>4  x  (»-4  binary  Ising  net  with  free 
bouularie?:  panel  (b).  tin'  output  of  a  binary  symmetric  channel;  panel  (<  ).  the  MAP 
estii .  ate;  ami  panel  (d).  .  u  approximation  to  the  MPM  estimate. 

Tito  main  disadvantage  of  this  procedure  is  that  in  the  case  of  the  segmen¬ 
tation  problem,  a  large  .amount  of  memory  might  be  required  if  the  number  of 
classes  per  element  in  is  large  (we  need  to  store  the  .V (in  —  1)  numbers  that 
define  the  posterior  marginals). 

With  respect  to  the  relative  performance,  we  point  out  that  in  many  cases, 
particularly  for  high  signal  to  noise  ratios,  the  MAP  estimate  is  usually  close 
to  the  optimal  one.  If  the  noise  level  is  high,  however,  the  difference  in  the 
performances  of  the  two  estimators  may  be  dramatic.  This  is  illustrated  in  the 
example  portrayed  in  Figure  1.  Panel  (a)  represents  a  typical  realization  of  a 
G4  x  G4  binary  Ising  net  with  free  boundaries,  using  a  value  of  To  =  1.74  (0.75 
times  the  critical  temperature  of  the  infinite  lattice),-  panel  (b),  the  output  of 
a  binary  symmetric  channel  with  error  rate  <=  =  0.4:  panel  (c)  the  MAP  esti¬ 
mate.  and  panel  ( d)  an  approxiinat i<»n  to  the  MPM  estimate  (  which  we  will  label 
"MPM  (  M.C. )"  )  obtained  using  the  Metropolis  algorithm  and  Equation  i  10 )  to 
estimate  the  posterior  density.  The  corresponding  values  of  the  posterior  energy 
L  />  (Equation  (22))  and  the  relative  segmentation  error  (<  ,/G4‘)  are  shown  on 
Table  3. 

It  is  clear  that  the  approximation  to  the  MPM  estimates  shown  in  panel 
(d)  is  better  than  the  MAP  from  almost  any  viewpoint. 


/ 

9 

/map 

/mpm  (m-c-) 

/mpm  (Det  ) 

Energy 

-5594.8 

-226.0 

-6660.9 

-6460.0 

-6247.0 

Seg.  Error 

— 

0.4 

0.33 

0.128 

0.124 

Table  3. 

An  intuitive  explanation  for  this  behavior  comes  from  the  fact  that  the 
MAP  estimator  is  implicitly  minimizing  the  expected  value  of  a  cost  functional 
/)  which  is  equal  to  zero  only  if  fi  =  /,  for  all  i,  and  is  equal  to, 
say,  M  otherwise.  If  the  signal  to  noise  ratio  is  sufficiently  high,  the  expected 
value  of  the  optimal  segmentation  error  will  be  very  close  to  zero,  so  that  / mpm 
and  / map  will  coincide.  In  a  high  noise  situation,  however,  the  MAP  estimator 
will  tend  to  be  too  conservative,  since  from  its  viewpoint  it  is  equally  costly  to 
make  one  or  one  thousand  mistakes.  The  MPM  estimator,  in  contrast,  can  make 
a  better  (although  more  risky)  guess,  since  making  a  few  mistakes  has  only  a 
marginal  effect  on  the  expected  cost. 

A  quantitative  comparison  of  the  performances  of  the  MAP  and  MPM  es¬ 
timators,  with  respect  to  the  segmentation  error,  can  be  obtained  using  the 
ratio: 

r  =  tMAP  _ 

&TPM 

=  E f,g  exP  [  ~  tW;  g)] e«  (/>  fMAp(g)) 

E/,9  exP  [  -  UP{f\  g)]e,(fJTPM{g)) 

In  Figure  2  we  show  a  plot  of  the  ratio  r  for  a  2  x  2  lattice,  for  different 
values  of  the  error  rate  t  and  the  natural  temperature  T0.  As  expected,  r  is 
never  less  than  1.  In  the  worst  case  (for  e  =  0.1  and  To  =  0.2)  the  error  of  the 
MAP  estimate  is  1.17  times  that  of  the  MPM  estimate:  if  Tq  is  not  too  small 
and  t  is  not  too  large,  both  estimates  coincide,  and  as  e  approaches  0.5  (low 
signal  to  noise  ratio),  the  MPM  estimate  is  consistently  better  than  the  MAP. 
An  experimental  analysis  of  larger  lattices  reveals  a  similar  qualitative  behavior, 
but  the  values  of  r  are  much  larger  in  this  case  (see  Table  3). 


5.  Examples  of  Applications  in  Vision 

5.1.  Reconstruction  of  Piecewise  Constant  Functions 


The  efficient  solution  of  this  problem  is  iclevant  for  several  reasons:  binary  im- 


Figure  2.  Ratio  of  the  average  errors  of  the  MAP  and  MPM  estimators  for  a  2  X  2 
Ising  net. 

ages  (or  images  consisting  of  only  a  few  grey  levels)  are  directly  useful  in  many 
interesting  applications  (for  example,  object  recognition  and  manipulation  in 
restricted  (industrial)  environments);  besides,  several  perceptual  problems,  such 
as  the  segmentation  of  textured  images  (Elliot,  et.  al.  (1983);  Hansen  and  El¬ 
liot  (1982);  Cohen  and  Cooper  (1984)),  or  the  formation  of  perceptual  clusters 
(Marroquin  (1985))  can  be  reduced  to  the  problem  of  reconstructing  a  piecewise 
constant  surface. 

The  prior  model  for  this  kind  of  functions  is  given  by  equations  (2)  and  (6), 
and  the  posterior  distribution  by  Equation  (8).  If  the  parameters  that  character¬ 
ize  the  system  (namely,  the  “natural  temperature”  To  and  the  noise  parameter 
a)  are  known,  the  MPM  estimator  produces  excellent  results,  such  as  the  one 
illustrated  in  Figure  1. 

5.1.1.  Parameter  estimation 

In  most  practical  cases,  however,  we  are  only  given  the  noisy  observations  g  and 
general  qualitative  information  about  the  structure  of  the  field  and  the  noise,  so 


(a) 


(b) 


(c) 


Figure  3.  (a)  Original  ternary  MRF.  (b)  Noisy  observations  (additive  Gaussian  noise), 
(c)  Optimal  (maximum  likelihood)  estimate. 


that  /,  the  noise  parameter  a  (which  is  a  function  of,  for  example,  the  error  rate 
e  when  the  noise  corruption  corresponds  to  a  Binary  Symmetric  Channel  (BSC), 
or  of  the  variance  <x2,  in  the  case  of  additive,  white  Gaussian  noise),  and  the 
interaction  parameter  T0  have  to  be  simultaneously  estimated.  One  plausible 
approach  (Geman,  1985)  is  to  use  the  “EM  algorithm”  (Dempster  et  al.,  1977) 
to  find  a  value  for  the  parameters  that  maximizes  locally  the  likelihood  function: 

L(a,T0)  =  ]QgP(g\a,T0)  (25) 

(see  Note  [1]).  For  example,  for  a  noise  model  that  corresponds  to  a  BSC  with 
error  rate  e,  the  EM  algorithm  takes  the  following  form.  We  start  with  some 
estimates  o(0),To(0)  for  the  parameters.  The  pth  iteration  (for  p  =  1,2,...) 
consists  of  two  steps. 

Expectation  (E-step):  Find  the  conditional  estimates  for  Uo  and  e  (see 
Equations  (6)  and  (9)). 

U0(p)  =  E[U0\gMp),T0{p)]  (26) 

e(p)  =  E  [%,  a(p),  To(p)]  (27) 

These  estimates  (which  are  ensemble  averages  taken  with  respect  to  the 
posterior  distribution  Pf\g\  see  Equation  (8))  can  be  approximated  using  the 
Monte  Carlo  procedure  described  in  Section  4,  i.e.,  computing  time  averages  of 
the  associated  ergodic  (“Gibbs”)  chain,  generated  using  the  posterior  energy  Up. 

Maximization  (M-step):  Find  T0(p  4-  l),o(p  +  1)  such  that: 

£[C„|a(p+l),r0(p+l)]  =Uo(ph 


(28) 


17 


Note  that  the  left  hand  side  of  the  above  expansions  -  the  unconditional  expec¬ 
tations  of  the  suffic  ient  statistics  U o  and  i  -  are  independent  of  the  data.  Thus, 
the  function 

E[Uo\at,  To]  =  E[Uo\T0)  =  *(T„)  (30) 

can  be  pre-computed  for  any  given  lattice  size,  using  again  the  Monte  Carlo 
procedure  of  Section  4,  but  this  time  with  the  prior  energy  Uo  instead  of  Up 
(in  Figure  3  we  show  this  function  for  a  64  x  64  binary  Ising  lattice  with  free 
boundaries).  In  this  way, 

TqP+X)  =  (31) 

can  be  computed  directly  using  a  table  look-up  procedure.  0(p+J)  can  be  com¬ 
puted  directly  from  the  noise  statistic  t(p)\  for  example,  for  a  BSC  we  have: 


,  i(p) 

a,p+1,  -  '“rnsr-  <32> 

It  can  be  shown  that  this  algorithm  will  eventually  converge  towards  a  fixed 
point  which  will  be  a  local  maximum  of  L(a,T0).  Its  use,  however,  has  some 
problems: 

Firstly,  we  note  that  each  iteration  of  the  algorithm  -  in  particular,  the  E- 
step  -  may  require  a  relatively  large  number  of  iterations  of,  say,  the  Metropolis 
algorithm.  Since  the  updated  values  of  the  parameters  (a(p+  l),7o(p  +  1)) 
are  not  necessarily  close  to  the  old  ones  (a(p),  To(p)),  the  Gibbs  chain  has  to 
be  allowed  to  reach  equilibrium  each  time,  before  starting  to  compute  the  cor¬ 
responding  time  averages,  which  makes  the  whole  procedure  computationally 
expensive.  Besides,  the  likelihood  function  L  will,  in  general,  be  multimodal, 
so  that  the  final  result  may  depend  strongly  on  the  choice  of  the  initial  values 
q(0),To(0)  (in  fact,  it  may  be  necessary  to  make  several  runs  with  different 
starting  points).  Finally,  we  have  found  experimentally  that  the  performance  of 
this  estimator  deteriorates  drastically  as  the  SNR  falls  below  a  certain  level  (for 
example,  when  the  error  rate  exceeds  0.25,  for  To  =  Tc  (the  critical  temperature 
of  the  infinite  lattice)). 

For  these  reasons,  we  have  used  a  different  approach,  which  is  computation¬ 
ally  more  efficient,  and  has  better  experimental  performance.  The  basic  idea  is 
to  use  some  statistics  computed  from  the  data  to  constrain  the  space  of  plau¬ 
sible  values  for  the  estimates  to  a  smooth  curve.  In  this  way  we  can  perform 
an  exhaustive  search  for  the  global  minimum  of  an  appropriate  merit  function 
by  varying  continuously  the  values  of  the  parameters,  so  that  the  equilibrium  of 
the  Gibbs  chain  is  maintained. 


To  illustrate  this  idea,  we  consider  the  case  of  a  binary  Ising  field  where 
the  noise  corruption  corresponds  to  a  BSC  (the  idea  can  be  easily  extended  to 
M-ary  Ising  fields  and  other  noise  models  -  see  Note  [3]). 

We  define  the  statistic  Ug  as: 

Ug  =  Vi,iV(gitgj)  (33) 

where  V  is  an  Ising  potential  (see  Section  2).  If  the  error  rate  of  the  channel  is 
e,  we  have  that 

E[Ug\a,T0]  ||  E[U0\a,To](4e2  -  4e  +  1)  =  9(T0)( 4e2  -  4e  +  1)  (34) 

If  we  make  the  assumption  that 

E[Ug\a,T0}  =  Ug  (35) 

where  U g  is  the  observed  statistic  (see  Note  [2]),  we  can  constrain  the  search  for 
the  estimates  a,  T0  to  the  curve  given  by  the  equations: 

i  =  1/2(1  -  (F,/*(f„»,/*| 


6  =  ,n(rb) 


(36) 


As  a  merit  function  we  have  used  the  expected  degree  of  uniformity  in  the  spa¬ 
tial  distribution  of  the  residuals.  In  particular,  we  define  a  likelihood  function 
L  by  covering  the  lattice  with  a  set  of  m  non-overlapping  squares  (say,  8  pixels 
wide);  computing  the  relative  variance  of  the  noise  parameter,  estimated  over 
each  square,  and  adding  all  these  terms  together: 

m 


I(a,T„)  =  -£(^)  , 

eo  ’ 


;37) 


where  Co  and  i}  are  the  (conditional)  expected  values  of  the  noise  density  over 
the  whole  lattice,  and  over  the  jth  square,  respectively,  and  can  be  approximated 
as  time  averages  of  the  corresponding  Gibbs  chain  (using  the  posterior  energy 
Up  and  a  and  To  as  parameters). 


The  optimal  estimate  for  (o,T0)  can  now  be  obtained  as  the  global  min- 
imizer  of  L  over  the  curve  (36).  Note  that  if  To  (and  hence  o)  are  varied 
slowly  enough,  so  that  the  associated  Gibbs  chain  is  maintained  approximately 
in  equilibrium,  the  computational  cost  of  this  search  will  be  equivalent  to  that 
of  a  single  “simulated  annealing"  experiment. 

This  estimation  algorithm  allows  us  to  reconstruct  a  pattern  /  from  the 
noisy  observations  g  without  having  to  adjust  any  free.  paramete.Ts.  The  only 
prior  assumptions  correspond  to  the  qualitative  structure  of  the  field  f  (first 
order,  isotropic  MRF)  and  to  the  nature  of  the  noise  process.  In  practice,  this 
means  that  we  can  apply  it  to  restore  any  piecewise  uniform  image  with  uniform 
granularity,  even  if  it  lias  not  been  generated  by  a  Markov  process.  \Y«  have  used 
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this  algorithm  to  reconstruct  a  variety  of  binary  images  with  excellent  results.  In 
Figure  4  we  show  such  a  restoration.  The  observations  (b)  were  generated  from 
the  synthetic  image  (a)  with  an  actual  error  rate  of  0.35  (assumed  unknown). 
The  optimal  estimate  for  /  is  shown  in  (c).  It  should  be  noted  at  this  point 
that  for  particular  problems,  it  may  be  possible  to  find  more  efficient  schemes. 
Thus,  for  binary  fields  sent  through  a  BSC,  we  have  developed  a  very  efficient 
(deterministic)  procedure  for  approximating  the  MPM  estimator  for  /  which 
can  also  be  used  to  find  the  optimal  estimates  for  a  and  T0  (see  Marroquin, 
1985  for  details). 


5.2.  Reconstruction  of  Piecewise  Continuous  Functions. 


In  this  section  we  will  illustrate  the  application  of  the  local  spatial  interaction 
models  and  estimation  techniques  that  we  have  described  to  the  reconstruc¬ 
tion  of  piecewise  continuoiis  functions  from  noisy  observations  taken  at  sparse 
locations. 

In  this  reconstruction,  it  will  be  important  not  only  to  interpolate  smooth 
patches  over  uniform  regions,  but  to  locate  and  preserve  the  discontinuities  that 
bound  these  regions,  since  very  often  they  are  the  most  important  parts  of  the 
function.  They  may  represent  object  boundaries  in  vision  problems  (such  as 
image  segmentation;  depth  from  stereo;  shape  from  shading;  structure  from 
motion,  etc.);  geological  faults  in  geophysical  information  processing,  etc. 
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As  we  mentioned  in  Section  1.4,  an  approach  to  this  problem  (see  Terzopou- 
los  (1984))  consist  of  first  interpolating  an  everywhere  smooth  function  over  the 
whole  domain,  then  applying  some  kind  of  discontinuity  detector  (followed  by 
a  thresholding  operation)  to  try  to  find  the  significant  boundaries,  and  finally, 
reinterpolating  smooth  patches  over  the  continuous  subregions. 

The  results  that  have  been  obtained  with  this  technique,  however,  are  not 
completely  satisfactory.  The  main  problem  is  that  the  task  of  the  discontinuity 
detector  is  hindered  by  the  previous  smooth  interpolation  operation.  This  be¬ 
comes  critical  when  the  observations  are  sparsely  located,  since  in  this  case,  the 
discontinuities  may  be  smeared  in  the  interpolation  phase  to  such  a  degree  that 
it  may  become  impossible  to  recover  them  in  the  detection  phase. 

In  contrast,  in  the  Bayesian  approach,  the  boundary  detection  and  interpo¬ 
lation  tasks  are  performed  at  the  same  time.  To  apply  the  general  reconstruction 
algorithms  developed  above  to  this  problem,  the  main  issue  is  the  representation 
of  the  concept  of  “piecewise  continuity”  in  the  form  of  a  prior  Gibbs  distribution 
in  a  meaningful  way. 

A  flexible  construction  involves  the  use  of  two  coupled  MRF  models:  one 
to  represent  the  function  (the  “surface”)  itself,  and  another  to  model  the  curves 
where  the  field  is  discontinuous.  A  coupled  model  of  this  kind  was  first  used  by 
Geman  and  Geman  (1984)  in  the  context  of  the  restoration  of  piecewise  constant 
images.  Terzopouloe  (1985)  has  recently  attempted  to  translate  this  idea  in  the 
continuous  and  deterministic  framework  of  regularization. 

This  model  can  be  adapted  to  our  problem  by  modifying  the  choice  of  the 
potentials  and  the  neighborhood  structure  of  the  coupled  MRF’s.  Specifically, 
the  following  modifications  are  needed: 

1.  Since  in  our  case  the  observations  are  sparse,  it  becomes  necessary 
to  expand  the  size  of  the  neighborhoods  of  the  line  field,  to  prevent  the  forma¬ 
tion  of  “thick”  boundaries  between  the  smooth  patches  (i.e.,  adjacent,  parallel 
segments  of  active  lines  in  these  regions).  In  particular,  we  propose  that  the 
dual  lattice  be  8-connected,  with  non-zero  potentials  for  the  cliques  of  the  form 
illustrated  in  Figure  5  (a)  and  (b).  The  inclusion  of  the  cliques  of  Figure  5-b  has 
the  additional  advantage  of  penalizing  the  occurrence  of  sharp  turns,  permitting 
us  to  model  the  formation  of  piecewise  smooth  boundaries  using  a  binary  line 
process  instead  of  the  4-valued  process  proposed  by  Geman  and  Geman.  The 
potentials  for  these  cliques  are  computed  in  the  following  way: 

Let  Va ,  Vj  denote  the  potentials  associated  with  the  cliques  Ca,  Cf,  of  Figure 
5  (a)  and  (b),  respectively,  and  let  S*  (k  t  {«,/>})  denote  the  number  of  line 
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Figure  5.  Cliques  for  the  line  process. 


elements  belonging  to  C'k  that  are  “on”  at  a  given  time,  i.e., 

sk=Y,  l<  *  k  =  a'b  w 

The  potentials  \\  are  given  by: 

Vie  =  0<t>k(Sk)  ,  k  =  a,b  (39) 

where  /3  is  a  constant,  and  the  functions  <f>k  are  defined  by  the  following  tables: 


Sa  0  1  2  3  4 

4>a  0  0.4  0.25  1.2  2.0 


Si,  0  1  2 
fa  0  0  10 


It  is  not  difficult  to  see  that  this  choice  of  potentials  will  effectively  discour¬ 
age  both  the  formation  of  thick  boundaries  (Sj  =  2)  and  the  presence  of  sharp 
turns  (Sa  —  3  and/or  S(,  =  2). 


2.  The  potentials  of  the  depth  process,  which  is  now  continuous  valued, 
have  to  be  modified  to  express  the  morp  relaxed  condition  of  piecewise  continuity 
(instead  of  piecewise  constancy).  Specifically,  we  propose: 

(40) 

Note  that  ltJ  6  {0,  1}. 

3.  Unlike  the  case  of  piecewise  constant  surfaces,  we  now  have  to  worry 
about  the  maximum  absolute  difference  in  the  values  of  two  adjacent  depth  sites 
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that  we  are  willing  to  consider  as  a  “smooth”  gradient  (and  not  a  discontinuity). 
This  value,  which  in  general  is  problem -dependent,  determines  the  magnitude 
of  the  constant  /?  in  Equation  (6),  which  can  be  interpreted  as  the  coupling 
strength  between  the  two  processes. 


Assuming  that  the  observations  are  corrupted  by  i.i.d.  Gaussian  noise,  we 
get  the  following  expression  for  the  posterior  energy: 

Vp(f,hg)  =  y £(/<  -  *)2(i  -  U,)+ 

°  iJ 

+5E  X>  -  erf  +  E  V‘W  +  E  (41) 

i€S  Ca  C6 

where  S  is  the  set  of  sites  where  an  observation  is  present.  As  a  performance 
criterion  we  will  use  a  mixed  cost  functional  of  the  form 

'« (/,',/.')  =  E(/.-  frf  +  E(J  -  4‘<0  -  ',)).  (42) 

where  L/,Li  denote  the  depth  and  line  lattices,  respectively.  This  error  criterion 
means  that  the  reconstructed  surface  should  be  as  close  as  possible  to  the  true 
(unknown)  surface,  and  that  we  should  commit  as  few  errors  as  possible  in  the 
assertions  about  the  presence  or  absence  of  discontinuities. 


Applying  the  results  of  section  3,  we  find  that  the  optimal  estimators  will 
be  the  posterior  mean  for  /  and  the  maximizer  of  the  posterior  marginals  for  l. 


There  is  one  serious  difficulty  that  prevents  us  from  applying  directly  the 
general  Monte  Carlo  procedure  that  was  derived  above  to  the  computation  of 
these  optimal  estimates:  since  the  depth  variables  are  continuous-valued,  if  we 
discretize  them  finely  enough  to  guarantee  sufficient  precision  of  the  results, 
the  computational  complexity  of  either  the  Metropolis  or  Gibbs  Sampler  algo¬ 
rithms  will  be  very  large.  One  way  around  this  difficulty  is  to  note  that  for  any 
fixed  configuration  of  the  line  field,  the  posterior  energy  becomes  a  non-negative 
definite  quadratic  form 

U(f\l,g)=  (/•  ~  fjf  +  ~  9j}2  +  h  (43) 

0  j€S 

where  a  and  K  are  constants  (note  that  the  first  sum  is  taken  only  over  those* 
pairs  of  sites  whose  connecting  line  element  is  “off,”  and  the  second  one  over 
the  set  S ).  This  means  that  the  posterior  distribution  of  the  depth  field  is  con¬ 
ditionally  Gaussian,  so  that,  for  any  fixed  /,  we  can  find  the  optimal  conditional 
estimator  /*  as  the  minimizer  of  (25). 


Let  us  define  the  set  F*  as 

F*  -  {(/,/) 


/  =  /;}■ 


<441 


*  "  4  *  %"*  -v’  *,*  .V 


Ak  It  .iV-i 


Figure  6.  Hybrid  network  implementing  the  surface  reconstruction  algorithm  of  Sec¬ 
tion  4.  The  voltage  at  every  node  represents  the  height  of  the  surface.  Inside  every 
rectangular  box  there  is  a  resistance  of  unit  magnitude  and  a  switch  whose  state  is 
controlled  by  the  corresponding  line  element  (see  text). 


It  is  clear  that,  if  /,  /  are  the  optimal  estimates  for  our  problem,  we  have  that 

(/,0er,  (45) 

which  suggests  that  we  can  constrain  the  search  for  the  optimal  estimators  to 
this  set.  This  can  be  done,  in  principle,  by  replacing  the  posterior  energy  with 
the  function 

u*(i)  =  u(f;,i)  (46) 

(which  depends  only  on  /),  and  use  the  standard  Monte  Carlo  procedures  to  find 
the  optimal  estimator  l.  To  illustrate  this  idea,  let  us  consider  a  physical  model 
in  the  next  section. 


5.2.1.  Hybrid  Parallel  Computers 


It  is  well  known  that  the  steady  state  of  an  electrical  network  that  contains  only 
(current  or  voltage)  sources  and  linear  resistors  will  be  the  global  minimizer  of 
a  quadratic  functional  that  corresponds  to  the  total  power  dissipated  as  heat 


24 


(Oster  et  al,  1971).  It  is  therefore  possible  to  construct  an  analog  network  that 
will  find  the  equilibrium  state  of  the  depth  field  for  a  given,  fixed  configuration  of 
the  line  process,  i.e.,  that  will  minimize  the  conditional  energy  (13)  (see  Poggio 
and  Koch,  1984;  also  Poggio  et  al.,  1985).  This  suggests  a  hybrid  computational 
scheme  in  which  the  line  field  (whose  state  is  updated  digitally,  using,  say,  the 
Metropolis  or  Gibbs  Sampler  algorithms)  acts  as  a  set  of  switches  on  the  con¬ 
nections  between  the  nodes  of  the  analog  network  whose  voltages  represent  the 
depth  process.  In  particular,  if  /,  represents  the  voltage  at  node  i,  the  hybrid 
network  can  be  represented  as  a  4-connected  lattice  of  nodes  (see  Figure  6)  in 
which: 

A  resistance  (of  unit  magnitude)  and  a  switch  (controlled  by  the  line  ele¬ 
ment  ltJ)  is  present  in  every  link  between  pairs  i,  j  of  adjacent  nodes. 

If  an  observation  g,  is  present  at  site  t,  a  current  of  magnitude  equal  to  ag, 
is  injected  to  the  corresponding  node,  which  must  also  be  connected  to  a 
common  ground  via  a  resistance  of  magnitude  1/a  (see  Equation  13). 

A  direct  application  of  Kirchoff  current  law  shows  that  at  each  node  i  of 
this  network  we  will  have 

(/»'  ~  /^K1  ~  l') )  +  °9./.  =  aq,gt,  (47) 

}  €  As 

which  corresponds  to  the  condition 

grad  U(f\l)  =  0,  (48) 

so  that  the  equilibrium  configuration  coincides  with  /,*. 

This  scheme  can  be  used,  in  principle,  to  construct  a  special  purpose  hybrid 
computer  for  the  fast  solution  of  problems  of  this  type.  In  a  digital  machine,  the 
exact  implementation  of  this  strategy  will  in  general  be  computationally  very 
expensive,  since  /*  must  be  computed  every  time  a  line  site  is  updated.  It  is 
possible,  however,  to  develop  approximations  which  have  an  excellent,  experi¬ 
mental  performance,  and  lead  to  efficient  implementations  ( Marroquin,  1985). 
The  performance  of  this  method  is  illustrated  in  Figure  7.  in  which  we  show: 
(with  height,  coded  by  grey  level)  the  observations  (a);  the  initial  state  of  the 
network  (with  all  the  lines  turned  "off" )  (b);  the  final  reconstructed  surface  (c), 
and  the  boundaries  found  by  the  algorithm  (d),  for  a  square  at  height  2.0  over 
a  background  at  constant  height  --  1.0. 
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6.  Signal  Matching 


In  all  the  estimation  problems  we  have  studied  so  far,  the  posterior  energy  func¬ 
tion  had  the  form 

W;  <7)  =  W) +  £$,(/,,  <7.),  (49) 


where  Uo(f)  corresponded  to  the  MRF  model  for  the  field  /.  The  functions 
whose  precise  form  depended  on  the  particular  noise  model,  were  non-decreasing 
functions  of  the  distance  between  /,  and  </,. 

There  are  some  cases,  however,  when  the  conditional  probability  distribu¬ 
tion  of  the  observations  Pg\f{g\  f)  is  multimodal  (as  a  function  of  /)  which  causes 
the  functions  <5,  to  be  non-monotonic,  so  that  the  solution  to  the  problem  re¬ 
mains  ambiguous,  even  if  the  observations  are  dense,  and  the  signal  to  noise 
ratio  arbitrarily  high.  To  illustrate  this  situation,  we  will  study  an  important 
instance  of  it:  the  “signal  matching”  problem,  whose  one-dimensional  version 
is  as  follows: 


Consider  two  one-dimensional,  real-valued  sequences  hi,hn,  where  hi  is 
obtained  from  h  u  by  shifting  some  subintervals  according  to  the  “disparity  se- 
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quence”  d: 


with 


hL{i)  =  hll(i  +  di), 

di  6  Q  =  {  -m,  -m  +  1, . . . ,  -1,0, 1, . . . ,  m). 


(50) 

(51) 


The  signal  matching  problem  is  to  find  d  given  h i,  h R.  (In  a  more  realistic 
situation,  we  do  not  observe  h.L,h.R  directly,  but  rather  some  noise-corrupted 
versions  <7l,<7r).  Some  interesting  instances  of  this  problem  are  the  match¬ 
ing  of  stereoscopic  images  along  epipolar  lines  (Marr  and  Poggio,  1976);  the 
computation  of  the  dip  angle  of  geological  structures  from  electrical  resistivity 
measurements  taken  along  a  bore  hole,  and  the  matching  of  DNA  sequences. 

To  make  the  discussion  more  specific,  we  will  consider  a  simple  example, 
in  which  the  sequences  hi,hR  are  binary  Bernoulli  sequences;  we  will  assume 
that  the  noise  corruption  process  can  be  modeled  as  a  binary  symmetric  channel 
with  known  error  rate,  and  that  d  is  known  to  be  a  piecewise  constant  function. 
A  well  known  instance  of  this  problem  is  the  matching  of  a  row  of  a  random  dot 
stereogram  with  density  p  (Julesz  (I960)),  when  the  components  of  the  stereo 
pair  are  corrupted  by  noise. 


The  stochastic  model  for  the  observations  is  then  constructed  by  assuming 
that  the  right  image  is  a  sample  function  of  a  Bernoulli  process  A  with  parameter 
p\  i.e., 

gR(i)  =  A(i).  (52) 

The  left  image  is  assumed  to  be  formed  from  the  right  one  by  shifting  it  by  a 
variable  amount  given  by  the  disparity  function  d,  except  at  some  points  where 
an  error  is  committed  with  probability  e.  Note  that  some  regions  that  appear  in 
the  right  image  will  be  occluded  in  the  left  one  (see  Figure  8).  The  “occlusion 
indicator”  <j>d  can  be  computed  deterministically  from  d  in  the  following  way: 


J  1,  if  di_k  >  di  +  k,  for  some  integer  k  €  (0,  m) 
\  0,  otherwise. 


(53) 


The  occluded  areas  are  assumed  to  be  “filled  in”  by  an  independent  Ber¬ 
noulli  process  B.  The  final  model  is  then: 

{gR{i  +  di)  with  prob.  1  —  e,  if  4>d{i)  =  0 

1  -  4ft(*  +  </,)  with  prob.  e,  if  <j>d(i)  =  0  (54) 

Bp(i)  with  prob.  1,  if  <f>d(i )  =  1. 

Note  that  in  the  two  dimensional  case,  the  index  i  denotes  a  site  of  a  lattice, 
and  therefore  it  can  be  represented  as  a  two-vector  ( * i ,  * 2 )  whose  components 
denote  the  column  and  row  of  the  site,  respectively.  To  simplify  the  notation, 
we  will  adopt  the  following  convention  throughout  this  section:  when  a  scalar 
is  added  to  this  vector  index  (as  in  gi;(i  +  d,)  and  <71  +  t  ),  it  will  be  implicitly 
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Figure  8.  Occluded  Regions:  The  horizontal  and  vertical  axis  represent  points  in  one 
row  of  the  left  and  right  images,  respectively.  Matching  points  are  represented  by  black 
circles.  Any  match  in  the  shaded  region  will  occlude  the  point  i. 


assumed  that  it  is  multiplied  by  the  vector  (1,0)  (so  that  the  above  expressions 
should  be  understood  as  gR(i  4-  (d^O))  and  di+(*,o),  respectively).  Using  this 
convention,  the  observation  model  of  Equation  (27)  can  be  applied  either  to  the 
one  or  to  the  two-dimensional  cases. 


Notice  that  even  if  the  observations  are  noise-free  (e  =  0)  the  solution  of 
the  problem  remains  ambiguous,  and  it  cannot  be  uniquely  determined  unless 
some  prior  knowledge  about  d  (for  example,  in  the  form  of  a  MRF  model)  is  in¬ 
troduced.  The  use  of  a  MRF  model  in  the  stereo  matching  case  corresponds  to  a 
quantification  of  the  assumption  of  the  existence  of  “dense  solutions”  (this  term 
was  introduced  by  Julesz  (1960),  and  essentially  corresponds  to  the  assumption 
that  the  disparity  d  varies  smoothly  in  most  parts  of  the  image;  see  also  Marr 
and  Poggio  (1979)),  and  the  use  of  the  occlusion  indicator  corresponds  to  the 
“ordering  constraint”  (i.e.,  the  requirement  that  if  i  >  j,  then  t  +  d,  >  j  +  d}, 
see  Baker  (1981);  we  put  4>j  =  l  whenever  this  constraint  is  violated). 


To  formulate  the  estimation  problem,  we  will  consider  the  sequence  gi  as 
“observations,”  while  gR  will  play  the  role  of  a  set  of  parameters.  Thus,  from 
(27),  we  have  (assuming,  for  simplicity  that  p  =  y): 


P[gi{i)  =  k\d,  gn)  =  Pg\d(k)  = 


1  -  e  if  4>d(i)  =  0  and  gR(i  +  d,)  —  k 
if  <t>d(i)  =  0  and  gR(i  +  d,)  ^  k 
if  <t>d(i)  =  1 


(55) 


As  a  prior  model  for  the  disparity  field,  we  may  use  a  first  order  MRF  with 
generalized  Ising  potentials,  such  as  the  one  presented  in  Section  5.1.  Other 
models  may  also  be  used,  including  the  coupled  depth  and  line  fields  that  we 
discussed  in  the  previous  section.  For  the  present,  let  us  assume  that  the  sim¬ 
pler  Ising  model  is  adequate.  Note  that  even  when  the  matching  problem  is 
one-dimensional  (we  are  assuming  that  there  is  no  vertical  disparity  between 
the  images,  so  that  the  matching  can  be  done  on  a  row-by-row  basis),  the 
two-dimensional  nature  of  the  prior  MRF  model  for  the  disparity  introduces  a 
coupling  between  matches  at  adjacent  rows.  The  posterior  energy  is 

Up(d-.  g)  =  Y  Y.  V(di,  d,)  +  i  Y.  Mi) ln2+ 

0  i.j  » 

+  ^  ~~  M*)M<7l.(*)  -  9r(i  +  d,)),  (56) 

t 

where 

«  =  ln(r^7).  (57) 

It  is  possible  to  apply  the  general  Monte  Carlo  algorithms  presented  above 
to  approximate  the  optimal  estimate  d  with  respect  to  a  given  performance 
measure  (such  as  the  mean  squared  error).  Their  use  in  this  case,  however,  is 
complicated  by  the  introduction  of  the  occlusion  function  <f>d  in  the  posterior 
energy:  the  size  of  the  support  for  this  function  equals  the  total  number  of  al¬ 
lowed  values  for  the  disparity  (see  Equation  (36)).  If  this  number  is  large,  the 
computation  of  the  increment  in  energy,  or  of  the  conditional  distributions  (if 
the  Gibbs  Sampler  is  used)  may  be  quite  expensive.  In  many  cases,  however, 
the  size  of  the  regions  of  constant  disparity  is  relatively  large  compared  with  the 
size  of  the  occluded  areas.  In  these  cases,  one  can  approximate  the  posterior 
energy  by: 

UP(d)  =  ^Y,  V^di)  +  |  E  -  0«(*  +  di))  (58) 

and  increase  significantly  the  computational  efficiency.  Notice  that,  this  approx¬ 
imate  functional  is  very  similar  to  the  functional  suggested  by  standard  regu¬ 
larization  (compare  Table  2).  It  is  also  possible,  particularly  for  the  high  signal 
to  noise  ratio  case,  to  design  deterministic,  highly  distributed  algorithms  for  the 
efficient  computation  of  the  optimal  estimator.  The  details  of  these  designs  can 
be  found  in  Marroquin,  1985. 

To  illustrate  the  performance  of  this  approach,  we  present  in  Figure  9  a 
random  dot  stereogram  portraying  a  square  floating  over  a  uniform  background 
(panel  (a)),  and  the  reconstructed  surface  (panel  (b)). 
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(a) 


(b) 


Figure  9.  (a)  Random  dot  stereogram,  (b)  Reconstructed  surface. 


7.  Parallel  Implementations. 

7.1.  Connection  Machine  Architectures. 


The  general  Monte  Carlo  procedure  that  we  have  presented  for  the  approxima¬ 
tion  of  the  optimal  Bayesian  estimators  of  MRF’s  can  be  greatly  accelerated 
if  it  is  implemented  in  a  parallel  architecture.  A  necessary  condition  for  the 
convergence  of  the  probability  measures  of  the  Markov  chains  defined  by  the 
Metropolis,  Gibbs  Sampler,  or  heat  bath  algorithms  to  the  posterior  Gibbs  dis¬ 
tribution  (and  therefore,  for  the  convergence  of  the  approximations  given  by 
Equations  (22)  and  (23)  to  the  desired  estimates)  is  that  if  two  sites  belong  to 
the  same  clique,  they  are  never  updated  at  the  same  time.  It  is  important  to 
note,  however,  that  this  condition  is  also  sufficient  only  for  the  case  of  the  Gibbs 
sampler  and  heat  bath  algorithms:  if  one  updates  simultaneously  the  states  of 
all  non-neighboring  sites,  the  regularity  of  the  resulting  Metropolis  chain  will 
be  destroyed,  so  that  it  will  no  longer  be  possible  to  guarantee  the  convergence 
of  the  Metropolis  algorithm  to  the  desired  result  (see  Note  [4]). 


If  one  implements  the  Gibbs  sampler  in  a  parallel  architecture  in  which  a 
processor  is  assigned  to  each  site,  the  total  execution  time  will  be  reduced  by  a 
factor  of 


N_ 

K 


(59) 


where  K  is  the  so  called  “chromatic  number”  of  the  graph  that  describes  the 
iiiunhood  structure,  and  it  is  equal  to  the  minimum  number  of  colors  needed 
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to  color  the  sites  of  the  lattice  in  such  a  way  that  no  two  neighbors  are  the  same. 


An  example  of  such  a  massively  parallel  architecture  is  the  “Connection 
Machine”  computer  (Hillis,  1985),  built  by  Thinking  Machines  Corporation. 

This  machine  was  originally  conceived  at  the  Artificial  Intelligence  labora¬ 
tory  at  MIT  and  was  further  developed  and  now  marketed  by  Thinking  Machines 
Corporation.  It  is  a  “Single  Instruction  Multiple  Data”  (SIMD)  array  proces¬ 
sor  consisting  in  the  version  presently  available  at  the  AI  Laboratory  of  64,000 
processing  units  (each  with  a  single  bit  Arithmetic/ Logical  unit)  organized  in  a 
four-connected  lattice  that  is  128  elements  square.  Besides  this  nearest-neighbor 
connectivity,  it  will  also  be  possible  (although  computationally  more  expensive), 
to  connect  any  two  processors  in  the  array  using  a  router  network. 

At  each  cycle  of  the  machine  an  instruction  is  executed  by  each  processor, 
and  a  single  bit  is  transmitted  to  its  neighbors.  This  means  that  the  updating 
scheme  can  be  implemented  most  efficiently  if  the  field  is  first  order  Markov, 
but  higher  order  processes  can  also  be  implemented  without  using  the  router  by 
successively  propagating  the  transmitted  state  (the  execution  time,  therefore, 
will  grow  linearly  with  the  order  of  the  field). 

To  make  this  discussion  more  concrete,  consider,  as  an  example,  the  prob¬ 
lem  of  finding  the  optimal  estimate  for  an  M-ary,  first  order  MRF  with  Ising 
potentials  (i.e.,  the  segmentation  of  a  piecewise  constant  image)  from  noisy 
observations.  Let  us  assume  that  the  estimator  is  to  be  implemented  in  the 
Connection  Machine  system,  and  suppose  that  by  the  use  of  appropriate  scaling 
factors,  all  the  numbers  can  be  represented  as  16-bit  integers.  We  will  use  the 
following  conservative  assumptions:  We  assume  that  16  cycles  of  a  single  1-bit 
processor  are  needed  to  perform  16-bit  addition,  subtraction  or  comparison; 
162  cycles  to  perform  multiplication  or  division;  2  x  162  cycles  for  generating  a 
pseudo-random  number  with  uniform  distribution  on  a  given  interval;  16  cycles 
for  memory  transfer  operations,  and  6  x  162  cycles  for  computing  an  exponential. 

Assuming  that  we  run  250  iterations  of  the  system,  and  ignoring  the  over¬ 
head  time  we  get  that 

Exec.  Time  «  1.4 (M  —  1)106  cycles  (60) 

For  the  particular  case  of  binary  images,  we  have  developed  a  deterministic 
scheme  for  which  this  execution  time  can  be  reduced  by  an  order  of  magnitude 
(see  Marroquin,  1985). 

In  the  case  of  the  reconstruction  of  piecewise  smooth  functions  from  sparse 
data,  the  optimal  estimator  can  also  be  implemented  in  this  machine.  To  study 
this  implementation,  we  first  note  that  the  chromatic  numbers  of  the  graphs  as¬ 
sociated  with  the  line  and  depth  neighborhood  systems  are  4  and  2,  respectively, 
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Figure  10.  (a)  Coloring  of  the  coupled  line-depth  lattice,  (b)  and  (c)  Elements  whose 
state  is  stored  in  each  of  the  two  types  of  processors  of  a  4-connected  parallel  archi¬ 
tecture. 

which  means  that  the  coupled  process  has  a  chromatic  number  of  6.  In  Figure 
10  (a)  we  illustrate  one  possible  “coloring.” 

The  colors  of  the  line  process  are  represented  by  the  numbers  1,2, 3, 4,  and 
those  of  the  depth  process  by  white  and  black  circles.  The  updating  process 
can  be  implemented  in  a  4-connected  architecture  by  assigning  one  processor 
to  each  depth  site  and  its  four  adjacent  line  elements.  We  will  thus  have  two 
different  populations  of  processors,  whose  configurations  are  shown  in  Figures  9 
(b)  and  (c),  respectively. 

Each  complete  iteration  consist  on  6  major  cycles:  in  the  first  two,  the  state 
of  the  white  and  black  depth  variables  is  respectively  updated,  and  in  the  next 
four,  the  new  states  of  the  binary  line  variables  stored  in  (say)  the  white  proces¬ 
sors  are  successively  computed  and  transmitted  to  the  corresponding  memory 
locations  of  the  neighboring  black  processors.  Note  that  in  this  scheme  we  have 
some  redundancy  in  the  use  of  memory  (each  binary  variable  is  stored  twice), 
but  the  state  of  all  the  elements  needed  for  each  updating  operation  is  always 
available  from  adjacent  processors.  The  Monte  Carlo  algorithm  requires  about 
200  iterations  to  converge.  As  before,  we  have  also  developed  in  this  case  a 
deterministic  scheme  with  very  good  experimental  performance,  for  which  the 
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execution  time  can  be  reduced  by  at  least  an  order  of  magnitude.  E.  Gamble 
has  now  implemented  several  MRF  models  on  the  Connection  Machine  system 
at  the  A1  Laboratory  as  part  of  the  Vision  Machine  project  (Poggio  et  al.,  1987). 

7.2.  Hybrid  Analog*Digital  Computers  and  Hopfleld  Networks 

As  we  mentioned  in  section  4.2.1,  the  reconstruction  of  piecewise  continuous 
functions  can  be  achieved  by  coupling  two  MRF’s,  one  corresponding  to  the 
continuous  field  and  the  other  to  the  discontinuities.  From  this  scheme  we  have 
suggested  a  special  purpose  parallel  computer  consisting  of  an  analog  network  of 
resistances  -  corresponding  to  the  continuous  intensity  field  -  and  a  digital  net¬ 
work  -  corresponding  to  the  line  process,  coupled  via  D-A  and  A-D  converters. 
The  idea  suggested  by  computer  experiments  (Marroquin,  1985)  is  that  the  two 
processes  can  rim  on  different  time  scales,  a  slow  one  for  the  digital  part  and  a 
fast  one  for  the  analog  network.  In  this  way  the  two  processes  are  effectively  de¬ 
coupled  and  the  continuous  field  finds  its  equilibrium  effectively  instantaneously 
after  each  update  of  the  line  process.  (Koch,  Marroquin  and  Yuille  (1985)  dis¬ 
cuss  implementations  of  this  idea.)  It  can  be  extended  to  multilayered  hybrid 
networks,  each  layer  corresponding  to  a  MRF  and  being  digital  or  analog  de¬ 
pending  on  the  continuous  or  binary  nature  of  the  field.  Hybrid  multilayered 
architectures  of  this  type  are  especially  attractive  for  implementing  the  fusion 
of  several  vision  processes. 

Finally,  we  mention  that  Koch,  Marroquin  and  Yuille  (1985)  have  been 
experimenting  successfully  with  a  special  type  of  analog  networks  Hopfield 
networks  -  whose  equilibrium  states  correspond  to  approximations  of  the  opti¬ 
mal  estimators. 


8.  Conclusions 

In  this  paper  we  have  presented  a  probabilistic  approach  to  the  solution  of  a 
class  of  perceptual  problems.  We  showed  that  these  problems  can  lie  l  educed  to 
the  reconstruction  of  a  function  on  a  finite  lattice  from  .i.  set  of  degraded  obser¬ 
vations,  and  derived  the  Dayesian  estimators  that  provide  an  optimal  solution. 
We  have  also  developed  efficient  distributed  algorithms  for  the  computation  of 
these  estimates,  and  discussed  their  implementation  in  different  kinds  of  hard¬ 
ware.  To  demonstrate  the  generality  and  practical  value  of  this  approach,  we 
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studied  in  detail  several  applications:  the  segmentation  of  noise-corrupted  im¬ 
age.'-:  the  reconstruction  of  piecewise  smooth  surfaces  from  sparse  data  and  the 
reconstruction  of  depth  from  stereoscopic  measurements. 

8.1.  Connection  with  Standard  Regularization 

The  maximum  a  posteriori  (MAP)  estimate  of  a  MRF  is  obviously  similar  to  a 
variational  principle  of  the  general  form  of  Equation  (7),  since  the  use  of  this 
criterion  defines  the  optimal  estimator  as  the  global  minimizer  of  the  posterior 
energy  Up  (Equation  (11)):  the  first  term  measures  the  discrepancy  between  the 
data  and  the  solution,  the  second  term  is  now  an  arbitrary  “potential”  function 
of  the  solution  (defined  on  a  discrete  lattice).  It  is  then  natural  to  ask  for  the 
connection  between  standard  regularization  principles  and  the  MRF  approach. 
It  turns  out  (from  Equation  (8))  that  a  MAP  estimate  leads  to  the  minimization 
of  a  functional  Up  (see  Equation  (9))-  in  general  not  quadratic  -  that  reduces 
to  a  quadratic  functional,  of  the  standard  regularization  type,  when  the  MRF 
is  continuous-valued,  the  noise  is  additive  and  gaussian  (the  term  T2  $,(  f,g,) 
will  be  quadratic)  and  first  order  differences  of  the  field  are  zero-mean,  indepen¬ 
dent,  gaussian  random  variables  (thus  the  a  priori  probability  distribution  is  a 
Gibbs  distribution  with  quadratic  potentials  so  that  the  term  the  term  Uq ( / )  is 
quadratic). 

8.2.  The  Fusion  Problem 

This  approach  also  permits,  in  principle,  the  incorporation  of  more  than  one 
modality  of  observations  into  a  single  estimation  process,  as  well  as  the  simul¬ 
taneous  estimation  of  several  related  functions  from  the  same  data  set.  This 
makes  one  hope  that  this  framework  could  be  useful  in  the  solution  of  difficult 
problems  that  require  such  an  integrated  approach. 

For  instance,  the  stereo  matching  problem  in  real  situations  has  not  been 
solved  yet  in  a  completely  satisfactory  way.  The  same  can  be  said  of  other 
related  perceptual  problems  such  as:  edge  detection;  image  segmentation;  the 
recovery  of  the  shape  of  an  object  from  a  single  two-dimensional  image  (the 
“shape  form  shading"  problem),  and  the  segmentation  of  a  scene  into  distinct 
objects,  as  well  as  the  recovery  of  their  three-dimensional  structure  from  the 
analysis  of  images  formed  at  successive  instants  of  time  (the  “structure  from 
motion"  problem).  All  these  problems  are  obviously  related,  and  it  is  intuitively 
clear  that  the  individual  solutions  that  can  be  obtained  should  improve  if  the 
mutual  constraints  that  the  solution  of  each  individual  problem  imposes  on  the 


others  were  taken  into  account.  Thus,  the  presence  of  a  brightness  edge  should 
increase  the  likelihood  of  a  depth  edge,  and  vice  versa;  the  depth  estimated 
from  stereo  should  be  compatible  with  the  shape  derived  from  shading;  points 
belonging  to  the  same  region  in  an  image  should  move  together,  etc.  We  be¬ 
lieve  that  these  constraints  can  be  incorporated  in  the  potential  functions  of  the 
corresponding  MRF  models,  so  that  the  combined  optimal  estimation  process 
represents,  in  fact,  an  integrated  cooperative  solution  to  these  problems,  with 
a  significantly  improved  performance.  Recently,  Poggio  has  discussed  how  to 
use  coupled  MRF  models  to  integrate  information  from  different  vision  modules 
(Poggio,  1985).  Gamble  and  Poggio  (in  Poggio  et  al.,  1987)  have  implemented 
efficient  algorithms  to  integrate  stereo,  motion  and  intensity  information  to  ob¬ 
tain  a  robust  description  of  the  discontinuities  in  the  scene. 


9.  Notes 


[1]:  It  is  computationally  unfeasible  to  perform  the  maximization  of  the  like¬ 
lihood  function  L  directly,  due  to  the  extraordinary  complexity  of  P(g\a,To): 


E/«p[-£M/;sl a,7j)] 

P(slo.To)=  = — 


where  lrp  is  given  by  Equation  (9).  However,  the  form  of  the  “complete  data” 
distribution  P(f,g\a.To)  (the  so-called  “regular  exponential  family  form”;  see 
Dempster  et  al.,  1977)  is  such  that  at  every  local  maximum  of  L  we  liave  that: 

E[UQ\g,a,To]  =  E[Uo\a,To] 

E[i\g ,  a,  T0]  =  E[e\a,  T0]  (62) 

where  e  and  Uo  are  defined  by  Equations  (6)  and  (7). 


Note  that  both  the  left  and  the  right  hand  sides  of  the  above  equations  can 
be  approximated  using  the  Monte  Carlo  procedure  described  in  Section  4  (us¬ 
ing  the  posterior  and  prior  energy,  respctively),  and  that  the  right  hand  side 
is  independent  of  the  observations.  These  relations  form  the  basis  of  the  EM 
algorithm. 

[2]:  Since  both  the  random  field  /  and  the  noise  process  are  stationary, 
we  have  that 

e[(u,  -  e[umM?\  =  #ofclil,JoflT^,t^ 

so  that  this  assumption  becomes  asymptotically  correct  for  large  lattices. 


(63) 


[3]s  Consider  an  M-ary  field  /  with  Ising  potentials,  corrupted  with  0- 
inean,  additive  white  Gaussian  noise  with  variance  a2  <  Suppose  that 

fjtQ  =  {q  ■  q  =  qo  +  2 kd,  k  =  1,2, ...  ,M)  for  all  i.  (64) 

We  define  the  statistic  \Vg  as 

=  (65) 

«.> 

where  g  is  the  observation  process,  Nc  is  the  number  of  nearest-neighbor  pairs 
in  the  lattice, 

{-1,  if  g,  =  g}  and  |r  -  j\  =  1 

1,  if  g,  ^  gj  and  |i  -  j |  =  1  (06) 

if  I*  ~  j\  ^ 

and  gi  —  q0  4-  2nd,  with  n  an  integer  such  that 

9o  +  (2n  -  l)d  <  </,  <  0  +  (2n  +  1)0.  (67) 

Note  that  it  is  possible  that  gi  &  Q. 

Define 

i  r 

V?(r,  a)  =  — — /  exp[— x2 /2a2\dx.  (68) 

V27T<T  Joo 

It  is  not  difficult  to  see  that 

E[W,\o.  To)  —  l  -  (A  +  B)  +  E[U0\To)(A  -  B)  (69) 

where  U0  =  U0/Nc ; 

A  =  Pr(W9(gi,gj)  =  -mfiJj)  =  -1), 

B  =  Pr(Wg(gi,gj)  =  -1|  V(fijj)  =  1)  (70) 

(note  that  E[Uo\T0]  =  ^(7o)  is  data  independent,  and  therefore,  it  can  be 
computed  off-line). 


Assuming  that 


Pr{\fi  ~  f}\  =  q\fi  *  fj) 


,  for  <7  =  1,2, —  1, 


M  -  1 

we  can  approximate  .4  and  B  by: 

A{a)  =  6a2  -f-  462  —  4a6  —  4a  +  1 

B(<r)  =  —■  A-3 a2  -  3 b2  +  2ab  +  2a), 

M  —  1 


(71) 


(72) 


where  a  —  ,p(d,&)  and  b  =  4>(3d,  cr).  (The  above  approximation  has  been  com¬ 
puted  assuming  that  4>{bd,ama.x)  «  0.  If  this  is  not  true,  more  terms  can  easily 
be  included.) 


Assuming,  as  before,  that 

E[Wq\a,To\  —  Wg  (computed  from  the  data), 


(73) 


36 


we  can  find  the  optimal  estimate  for  (a, To)  as  the  global  maximizer  of  an  ap¬ 
propriate  merit  function  along  the  curve 


T0  = 


A(a)  -  B(o) 
using  a  “composite  annealing”  strategy. 


(74) 


To  define  the  merit  function,  we  cover  the  lattice  with  a  set  of  non-overlapping 
squares  and  add  the  relative  variance  of  the  noise  parameter  over  each  square, 
thus: 

(75) 

(70  and  <7j  can  be  approximated  as  time  averages,  with  respect  to  the  Gibbs 
chain  with  cr  and  To  as  parameters,  of  the  estimated  noise  variance  over  the 
whole  lattice  and  over  the  jth  square,  respectively. 

[4]:  As  a  simple  example  for  which  the  regularity  of  the  Metropolis  chain 
is  destroyed,  consider  a  3  x  3  ginary  Ising  lattice  with  periodic  boundary  condi¬ 
tions.  It  is  easy  to  see  that  for  the  initial  state: 


1  0  1 

0  1  0.  (76) 

1  0  1 

The  Metropolis  algorithm,  either  with  lexicographic  updating  order,  or  with 
simultaneous  updating  of  all  non-neighboring  sites,  will  produce,  deterministi¬ 
cally,  the  sequence: 

10  1  0  10  10  1 

1  0  1^0  1  0  ->  0  1  0-f...  (77) 

0  10  10  110  1 

for  any  finite  temperature. 
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